Training Set Construction Methods

نویسنده

  • Tomas Borovicka
چکیده

In order to build a classification or regression model, learning algorithms use datasets to set up its parameters and estimate model performance. Training set construction is a part of data preparation. This important phase is often underestimated in data mining process. However, choose the appropriate preprocessing algorithms is often as important as choose the suitable learning algorithm. Goal of training set construction algorithms is to build representative datasets by discarding useless instances and enforcing important instances. Good quality training set is a good premise to build a well learned and reliable model. Lot of literature have been published about comparison of learning algorithms and regression or classification models, but good review and comparison of training set construction methods have not yet been given. This work is focused on how to select data samples from an original set and place them into the training and testing sets. In the first part is an overview of existing approaches and new possible approaches are discussed. The second part is focused on experimental comparison of these methods.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data construction for phosphorylation site prediction

Protein phosphorylation is one of the most pervasive post-translational modifications, regulating diverse cellular processes in various organisms. As mass spectrometry-based experimental approaches for identifying phosphorylation events are resource-intensive, many computational methods have been proposed, in which phosphorylation site prediction is formulated as a classification problem. They ...

متن کامل

Improvement of Predictive Ability by Uniform Coverage of the Target Genetic Space

Genome enabled prediction provides breeders with the means to increase the number of genotypes that can be evaluated for selection. One of the major challenges in genome enabled prediction is how to construct a training set of genotypes from a calibration set that represents the target population of genotypes, where the calibration set is composed of a training and validation set. A random samp...

متن کامل

Collaborative Training of Tensors for Compositional Distributional Semantics

Type-based compositional distributional semantic models present an interesting line of research into functional representations of linguistic meaning. One of the drawbacks of such models, however, is the lack of training data required to train each word-type combination. In this paper we address this by introducing training methods that share parameters between similar words. We show that these...

متن کامل

Optimizing Training Set Construction for Video Semantic Classification

We exploit the criteria to optimize training set construction for the large-scale video semantic classification. Due to the large gap between low-level features and higher-level semantics, as well as the high diversity of video data, it is difficult to represent the prototypes of semantic concepts by a training set of limited size. In video semantic classification, most of the learning-based ap...

متن کامل

An Incremental Learning Algorithm That Optimizes Network Size and Sample Size in One Trial

| A constructive learning algorithm is described that builds a feedforward neural network with an optimal number of hidden units to balance convergence and generalization. The method starts with a small training set and a small network, and expands the training set incrementally after training. If the training does not converge, the network grows incrementally to increase its learning capacity....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013